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Abstract 

Modern statistical applications involving large data sets have focused attention on 
statistical methodologies which are both efficient computationally and able to deal with 
the screening of large numbers of different candidate models. Here we consider com- 
putationally efficient variational Bayes approaches to inference in high-dimensional het- 
eroscedastic linear regression, where both the mean and variance are described in terms 
of linear functions of the predictors and where the number of predictors can be larger 
than the sample size. We derive a closed form variational lower bound on the log 
marginal likelihood useful for model selection, and propose a novel fast greedy search 
algorithm on the model space which makes use of one-step optimization updates to the 
variational lower bound in the current model for screening large numbers of candidate 
predictor variables for inclusion/exclusion in a computationally thrifty way. We show 
that the model search strategy we suggest is related to widely used orthogonal matching 
pursuit algorithms for model search but yields a framework for potentially extending 
these algorithms to more complex models. The methodology is applied in simulations 
and in two real examples involving prediction for food constituents using NIR technology 
and prediction of disease progression in diabetes. 
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1 Introduction 



Consider the heteroscedastic linear regression model 

yi = xj(3 + aiei, i = l,...,n (1) 

where i/i is a response, Xi = {xii,...,Xip)'^ is a corresponding p- vector of predictors, f3 = {f3i,...,/3p)'^ 
is a vector of unknown mean parameters, ej~iV(0,l) are independent errors and 

logo-,^ = zfa, 

where Zi = {zii,...,Zig)'^ is a g-vector of predictors and a = {ai,...,aq)'^ are unknown variance 
parameters. In this model the standard deviation cTj of yi is being modeled in terms of the 
predictors zf, this heteroscedastic model may be contrasted with the usual homoscedastic 
model which assumes is constant. We take a Bayesian approach to inference for this model 
and consider a prior distribution on 6 = {(5^ ,a^)'^ of the form p{6)=p{P)p{a) withp(/3) and 
p{a) both normal, A^(/i°,S°) and A^(/i° ,S° ) respectively. It is possible to consider hierarchical 
extensions for the priors on p{f3) and p{a) but we do not consider this here. 

We will consider a variational Bayes approach for inference (see Section |2] for the details). 
The term variational approximation refers to a wide range of different methods where the idea 
is to convert a problem of integration into an optimization problem. In Bayesian inference, 
variational approximation provides a fast alternative to Monte Carlo methods for approxi- 
mating posterior distributions in complex models, especially in high-dimensional problems. 
In the heteroscedastic linear regression model, we will consider a variational approximation 
to the joint posterior distribution of (3 and a as q{f3,a) =q{f3)q{a), where q{(3) and q{a) are 
both normal densities, A^(/i^,S^) and A^(/iJ?,,S^) respectively. It is also possible to give a vari- 
ational treatment in which independence is not assumed between /3 and a (John Ormerod, 
personal communication) but this complicates the variational optimization somewhat. We at- 
tempt to choose the parameters in the variational posterior yU^, /i^, and to minimize the 
Kullback-Leibler divergence between the true posterior distribution p{f3,a\y) and q{(3,a). This 
results in a lower bound on the log marginal likelihood \ogp{y) - a key quantity in Bayesian 
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model selection. The first contribution of our paper is the derivation of a closed form for the 
lower bound and the proposal of an iterative scheme for maximizing it. This lower bound 
maximization plays a crucial role in the variable selection problem discussed in Section [31 

Variable selection is a fundamental problem in statistics and machine learning, and a large 
number of methods have been proposed for variable selection in homoscedastic regression. 
The traditional variable selection approach in the Bayesian framework consists of building 
a hierarc hical Bayes model and using MCMC algorithms to est i mate posterio r mode l prob- 



abilities ( iGeorge and McCuUoch 



1993 



Smith and Kohn 



1996 



Raftery et al. 



1997!) • This 



methodology is computationally demanding in high-dimensional problems and there is a need 



for fast alternatives in some applications. In high-dimensiona. 



include the family of greedy algo rithms (iTropp 



pursuit flMallat and Zhang 



to the Lasso (|Tibshiranil. 



fcoOTf ) 



Efron et al. 



2004 : 



Zhang 



settings, alternative approaches 



20091 ) . also known as matching 



19931 ) in signal processing. Greedy algorithms are c 



9961 ) and the 



(120041 ) and 



ARS algorithm (lEfron et al. 



20041 ). See 



osely related 



Zhao and Yu 



ZhangI (120091 ) for excellent comparisons of these families of al- 



gorithms. In the statistical context, greedy algorithms have been proven to be very efficient 



for variable selection in linear regression under t 



the variance is assumed to be constant (jZhang 



le ass umption of homoscedasticity, i.e. where 



20091 ). 



In many applications the assumption of constant variance may be unrealistic. Ignoring het- 
eroscedasticity may lead to serious problems in inference, such as misleading assessments of sig- 
nificance, poor predictive performance and inefficient estimation of mean p arameters. In som e 
cases, learning the st r uctur e in t 



Carroll and Ruppert 



le variance may be t he primary goal. See 



(119881 ) and 



Chan et al. 



Ruppert et al. 



(]2006f ). 



(120031), Chapter 14, for a more detailed discus- 



sion on heteroscedastic modeling. Despite a large number of works on heteroscedastic linear 



regression and overdispersed 
to depend on the covariates ( 



1987 



Smyth 



1989 



generalized 



Efron 



Yee and Wild 



1986 



1996 



inear models in which ove r dispersion is modelec 



Nelder and Pregibon 



1987 



Rigbv and Stasinopou 



able selection seem to be somewhat overlooked. 



Yau and Kohnl tOQch and 



Davidian and Carroll 



20051) , methods for vari- 



Chan et al. 



consider Bayesian variable selection using MCMC computational approaches in heteroscedas- 
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tic Gaussian model s. They disc u ss ex tensions involving flexible modeling of the mean and 



variance functions. 



Cottet et al. 



( 120081 ) consider extensions to overdispersed generalized lin- 



ear and generalized additive models. These approaches are computationally demanding in 



high-dimensional set tings. A genera . 



is also considered by 



and flexi ble framework for modeling over dispersed data 



Yee and WildI (119961 ) and lRigby and StasinopoulosI (120051 ). Methods for 



model selection, however, are less well developed. A common approach is to use informa- 
tion criteria such as generalized AIC a r id BI G together with forward stepwise methods (see. 



for example. 



Rigby and StasinopoulosI (120051 ) . Section 6). We compare our own approaches 



to such methods later. A main contribution of the present paper is to propose a novel fast 
greedy algorithm for variable selection in heteroscedastic linear regression. We show that the 
proposed algorithm is in homoscedastic cases similar to currently used methods while having 
many attractive properties and working efficiently in high- dimensional problems. An efficient 
R program is available on the authors' websites. 



In Section m we apply our algorithm to the analysis of the diabetes data (lEfron et al. 



20041 ) 



using heteroscedastic linear regression. This data set consists of 64 predictors (constructed 
from 10 input variables for a "quadratic model") and 442 observations. We show in Figure [T] 
the estimated coefficients corresponding to selected predictors as functions of iteration steps in 
our algorithm, for both the mean and variance models. The algorithm stops after 11 forward 
selection steps with 8 and 7 predictors selected for the mean and variance models respectively. 

The rest of the paper is organized as follows. The closed form of the lower bound and the 
iterative scheme for maximizing it are presented in Section [2l We present in Section [3] our 
novel fast greedy algorithm, and compare it to existing greedy algorithms in the literature 
for homoscedastic regression. In Section H] we apply our methodology to the analysis of two 
benchmark data sets. Experimental studies are presented in Section |5l Section |6] contains 
conclusions and future research. Technical derivation is relegated to the Appendices. 
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Mean model 



Variance model 



C5 




Steps steps 



Figure 1: Solution paths as functions of iteration steps for analyzing the diabetes data using 
heteroscedastic linear regression. The algorithm stops after 11 iterations with 8 and 7 predic- 
tors selected for the mean and variance models respectively. The selected predictors enter the 
mean (variance) model in the order 3, 12, 28 (3, 9, 4). 

2 Variational Bayes 



We now give a brief introduction t o the variational app r oximation met hod. For a more de- 



t ailed exposition see , for 



example. 



Jordan et al. 



fll999[ ) 



Bishop 



(120061) Chapter 10, or see 



Ormerod and Wandl (120091 ) for a statistically oriented introduction. The term variational ap- 
proximation refers to a wide range of different methods where the idea is to convert a problem 
of integration into an optimization problem. Here we will only be concerned with applications 
of variational methods in Bayesian inference and only with a particular approach sometimes 
referred to as parametric variational approximation. Write 6 for all our unknown parameters. 



p{9) for the prior distribution and p{y\9) for the likelihood. For Bayesian inference, decisions 



5 



are based on the posterior distribution p{6\y) (xp{6)p{y\6) . A common difficulty in applications 
is how to compute quantities of interest with respect to the posterior. These computations 
often involve the evaluation of high-dimensional integrals. Variational approximation pro- 
ceeds by approximating the posterior distribution directly. Formally, we consider a family 
of distributions ^(^|A) where A denotes some unknown parameters, and attempt to choose A 
so that q{0\\) is closest to p{0\y) in some sense. In particular, we attempt to minimize the 
KuUback-Leibler divergence 

/.o.i|||„«|A).. 
with respect to A. Using the identity 

logp(y) = /log?;<«,(0|A)d9+/logi|^,(9|A)<iO, (2) 

where p{y) = jp{0)p{y\9)d9, we see that minimizing the Kullback-Leibler divergence is equiv- 
alent to the maximization of 



/ 



Here ([3]) is a lower bound (often called the free energy in physics) on the log marginal likelihood 
\ogp{y) due to the non- negativity of the Kullback-Leibler divergence term in ([2]). The lower 
bound ([3D, when maximized with respect to A, is often used as an approximation to the log 
marginal likelihood \ogp{y). The error in the approximation is the Kullback-Leibler divergence 
between the approximation q{9\\) and the true posterior. The approximation is useful, since 
logpiy) is a key quantity in Bayesian model selection. 

For our heteroscedastic linear model, the lower bound ([3]) can be expressed as 

L = Ti+T2 + T3, 

where 

Ti = / \og\p{l3,a)]q{(3)q{a)d(3da, 



\og[p{y\f3, a)]q{(3)q{a)d(3da, 
T-i = - I \og[q{(3)q{a)]q{(3)q{a)d(3da. 



We show (see the Appendix A) that these three terms, which are all expectations with respect 
to the (assumed normal) variational posterior, can be evaluated analytically. Putting the 
terms together we obtain that the lower bound ([3]) on the log marginal likelihood is 

L = ^-^log27r + ilog|S^S°-^| + llog|S^SrV^tr(S°^^^^^ 

-^tr(sr^s^) - i(/.^ - f.ifj:y\fil - 4) -lif^l- f^'af^'^if^l - ^l) 

This needs to be maximized with respect to /x^, /x^, S^, T,'^. We consider an iterative scheme 
in which we maximize with respect to each of the blocks of parameters /i^, /x^, S^, with 
the other blocks held fixed. 

Write X for the design matrix with ith row xj and D for the diagonal matrix with ith 
diagonal element 1 /exp{zf fi^^ — ^zfT,'^Zi) . Maximization with respect to //^ with other terms 
held fixed leads to 

= [X^DX + S° ' (S° - + X^Dy) . 
Maximization with respect to with other terms held fixed leads to 

E}={j:y' + x^Dxy\ 

Handling the parameters fi^, and in the variational posterior for a is more complex. 
We proceed in the following way. If no parametric form for the variational posterior q{a) 
is assumed (that is, if we do not assume that q{a) is normal) but only assume the factor- 



ization q(9) = q(0)q 



a) the n the optimal choice for q{a) for a given q{/3) = A^(/i^,Sp is (see 



Ormerod and Wandl (120091 ). for example) 

g(a)ocexp(i?(log[p(%(|/|^)])), (5) 

where the expectation is with respect to q{(3). Similar to the derivation of the lower bound 
(jll), it is easy to see that 

g(a)ocexp --2^2;, —^7^. ^{o^ - f^a) K (o^ - f^a, 

\ 1=1 1=1 ^ * ' 



which takes the form of the posterior (apart from a normahzation constant) for a Bayesian 
generahzed hnear model with gamma response and log link, coefficient of variation a/2, and 
responses i^j = — xf /x^^+xf S^Xj with the log of the mean response being zfa. The prior 
in this gamma generalized linear model is A^(/x°,S°). If we use a quadratic approximation 
to logg(a) then this results in a normal approximation to q{a). We choose the mean and 
variance of the normal approximation simply by the posterior mode and the negative inverse 
Hessian of the log posterior at the mode for the gamma generalized linear model described 
above. The computations required are standard ones in fitting a Bayesian generalized linear 
model (see Appendix B). Write Z for the design matrix in the variance model with ith row 
zf and write W{a) (as a function of a) for the diagonal matrix dia.g{^Wiexp{—zfa)). With 
/i^ the posterior mode, we obtain for the expression 

Our optimization over //^ and T,'^ is only approximate, so that we only retain the new values 
in the optimization if they result in an improvement in the lower bound (jlj). The advantage of 
our approximate approach is the closed form expression for the update of once /i^ is found, 
so that explicit numerical optimization for a possibly high-dimensional covariance matrix is 
avoided. 

The explicit algorithm for our method is the following. 
Algorithm 1: Maximization of the variational lower bound. 

1. Initialize parameters /x^, S^. 

2. fil^ (^X'^DX + T.y^^ ^ (rPf^ii^p+X^Dy^ where D is the diagonal matrix with ith 
diagonal entry l/exp(2;f/i^ — 1/2^;^ S^^;^). 

4. Obtain /x^ as the posterior mode for a gamma generalized linear model with normal 
prior A^(/i°,S°), gamma responses Wj = (?/j— xf /x^^+xf S^x,, coefficient of variation \/2 
and where the log of the mean is zja. 

8 



5 




where W is diagonal with ith diagonal element Wjexp(— yU^)/2. 



6. If the updates done in steps 3 and 4 do not improve the lower bound (jl]) then their old 
values are retained. 

7. Repeat steps 2-6 until the increase in the variational lower bound (jlj) is less than some 
user specified tolerance. 

For initialization, we first perform an ordinary least squares (OLS) fit for the mean model to 
get an estimate (3 of /3. Then we take the residuals from this fit, say rj = (yi — xj^)^, and do 
an OLS fit of logrj to the predictors Zi to obtain our initial estimate of /i^. The initial value 
of is then set to the covariance matrix of the least squares estimator. When the OLS fits 
are not valid, some other method such as the Lasso can be used instead. The application of 
this algorithm to the problem of variable selection in Section |3] always involves only situations 
in which the above OLS fits are available. 

We mention one further extension of our method. We have assumed above that the prior 
covariance matrices and are known. Later we will assume S^ = cr^/ and S[^ = cr^/ where 
/ denotes the identity matrix and and o"^ are scalar variance parameters. We further 
assume that /i^ = and /i^ = 0. It may be helpful to perform some data driven shrinkage so 
that cr^ and cr^ are considered unknown and to be estimated from the data. Our lower bound 
(jlj) can be considered as an approximation to \ogp{y\a'^,a'^), and the log posterior for 0"^,cr^ 
is apart from an additive constant 



If we assume independent inverse gamma priors, IG{a,b), for o"^ and cr^, and if we replace the 
log marginal likelihood by the lower bound and maximize, we get 



logp((T^, (tI) + \ogp{y\al, al). 




a + 1 + p/2 



and 



, _ &+l;,^V^ + ltr(S^) 
a + l + q/2 

These updating steps can be added to the Algorithm 1 given above. 
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3 Model selection 



In the discussion of the previous sections the choice of predictors in the mean and variance 
models was fixed. We now consider the problem of variable selection in the heteroscedastic 
linear model, and the question of computationally efficient model search when the number of 
candidate predictors is very large, perhaps much larger than the sample size. In Sections 1 
and 2 we denoted the marginal likelihood by p{y) without making explicit conditioning on the 
model but now we write p{y\m) for the marginal likelihood in a model m. If we have a prior 
distribution p(m) on the set of all models under consideration then Bayes' rule leads to the 
posterior distribution on the model given by p{m\y) o<:p{m)p{y\m) . We can use the variational 
lower bound on logp{y\m) as a replacement for \ogp{y\m) in this formula as one strategy for 
Bayesian variable selection when p(y|m) is difficult to compute. We follow that strategy here. 
For a more thorough review of the Bayesian approach to model selection see, for example. 



O'Hagan and Forsterl (120041 ) 



U sing t 



2003 



le maximized low er bound is a popular approach for model selection (IBeal and Ghahramani 



2006 



WuetaL 



20101). The error in using the lower bound to approximate \ogp{y\m) is 
the Kullback-Leibler divergence between the true posterior p{6\y) and the variational distri- 
bution q{6\X). The true posterior in our model has the structure of a product of two normal 
distributions and the variational distribution we use is also a product of two normals. There- 
fore, it can be expected that the minimized KL divergence is small, thus leading to a good 
approximation. The experimental study in Section 5 suggests that the maximized lower bound 
is very tight. 

Before presenting our strategy for ranking variational lower bounds, we discuss here the 
model prior. Suppose we have a current model with predictors Xi, i&C C -D = {!,... ,p} in the 
mean model and Zj, iGl^C-E = {l,...,g} in the variance model. The subsets C and V give indices 
for the currently active predictors in the mean and variance models. Let vrf (ttJ) be the prior 
probability for inclusion of Xi {zj) in the mean (variance) model, and write vr^ = (vrf ,...,7r^)"^. 
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n'^ = {n^,...,TT^)'^ . We assume that the inclusions of predictors are independent a priori with 

P(Ck^) = n < 11(1 - vrf ), piVlnn = n 11(1 - TTj), 
iec i^c j&v j^v 

and the prior probabihty of a model m with index sets C and V in its mean and variance 

models is assumed to be 

p{m) = p{C, V\7i^, n") = p{C\7i^)p{V\7i''). (6) 

If no such detailed prior information is available for each individual predictor (which is the 
situation we consider in this paper), one may assume that vrf = ... = 7r^ = 7r^ and 7r^ = ... = 7r^ = 7ro- 
(we note a slight abuse of notation here). Then 

PiCK) = vrjf 1(1 - vr,)^H^I p(\/|vr.) = vrjl(l - 7r.)«-l^l, (7) 

where hyperparameters tt^, tTo- G [0,1] are user-specified. One can encourage parsimonious 
models by setting small (< 1/2) vr^ and tTo-. The smaller the vr^ and tTo-, the smaller prior 
probabilities are put on complex models. By setting 7r^ = 7ro- = 1/2, one can set a uniform prior 
on the models. Another option is to put uniform distributions on vr^ and n^. Then 

p{C) = [\{C\n,)dn, oc (i^i^ , p{V) = j^'piVMdn^ (^^^^ 







Chen and Chenl (120081 ) 



This prior agrees with the one used in the extended BIG proposed by 
It has the advantage of requiring no hyperparameter while still encouraging parsimony. We 
recommend using this as the default prior. 

We now consider adding a single variable in either the mean or the variance model, and 
then a one-step update to the current variational lower bound in the proposed model as a 
computationally thrifty way of ranking the predictors for possible inclusion. In our one-step 
update, we consider a variational approximation in which the variational posterior distribution 
factorizes into separate parts for the added parameter and the parameters in the current model, 
as well as the factorization of mean and variance parameters in Section 2. We stress that this 
factorization is only assumed for the purpose of ranking predictors for inclusion. Once a 
variable has been selected for inclusion, the posterior distribution is approximated using the 
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method outlined in Section 2. Write for the parameters in the current mean model and 
Xc for the corresponding design matrix, and ay for the parameters in the current variance 
model with Zy the corresponding design matrix. Write xci for the ith row of Xc and zyi for 
the 2th row of Zy. 

3.1 Ranking predictors in the mean model 

Let us consider first the effect of adding the predictor Xj, j eD\C, to the mean model. We 
write for the coefficient of Xj and we consider a variational approximation to the posterior 
of the form 

q{e) = qmqWMc'v), (9) 

with g(/3c) = A^(/i^C'^^c)' ^((^v) = N {fily ,i:ly) and = A^(/i^j.,(a^^.)^)- Suppose that we 
have fitted a variational approximation for the current model (i.e. the model without xj) using 
the procedure of Section 2. We now consider fitting the extended model with fi'^^(j,T,1^^,fil^y 
and fixed at the optimized values obtained for the current model, and consider just one 
step of a variational algorithm for maximizing the variational lower bound in the new model 
with respect to the parameters /i^^.,(cr^^)^. In effect for our variational lower bound @, we 
are assuming that the variational posterior distribution for (/3c'^,/3j)^ is normal with mean 
(/^^c^'/^ft)^ and covariance matrix 

Substituting these forms into (jl]) and further assuming /U° = 0, /u[^ = 0, S°=(t|/ and S° =0"^/ 
(see the remarks at the end of Section 2), we obtain the lower bound 

, , ^11, H^f iA 4K.)^ + 4(/^ft-)^-2^^./^^.(^^-^g./^^c) 

2 + 2°^ al 2a} 2a^ 2^. {z^^^ly - hzlKyZ^v) 

(10) 

where Lqm is the previous lower bound for the current model without predictor j. Here we are 
writing Xij for the value of predictor j for observation i. Optimizing the above bound with 
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respect to /i^^- and (o"^-,)^ and writing /i^^- and {^IjY for the optimizers gives 

^ exp (^i/i^v- - hzviKvZv^) I ' 1^ exp (4^//^^ - \zlJZyZvi) 



and 



^ § exp {zl^iily - \zl^T,lyZyi) I ■ ^^^^ 



Substituting these back into the lower bound (fTOj) gives 



LoiA + ^ log + i^j^- (13) 



If the variance model contains only an intercept, this result agrees with greedy selection 



algorithms where predictors are ranked accordi ng to t 



l e corr elation between a predictor and 



Zhang 



the residuals from the current model (see, e.g. 

detail in Section 1231 Later we write the optimized value of f llOp as Lj'\C,V), the superscript 
M means the lower bound associated with the model for mean. 



( I2OO9I )). We will discuss this point in 

Ml 



3.2 Ranking predictors in the variance model 

So far we have considered only the addition of a predictor in the mean model. We now attempt 
a similar analysis of the effect of inclusion of a predictor in the variance model. With the 
mean model fixed, suppose that we are considering adding a predictor Zj, j ^ E\V, to the 
variance model. We consider a normal approximation to the posterior q{6) = q{f3c')(l{<yv)(l{<^j) 
with g(/3c) = A^(/i^C'^^c)' li<^v) = N{iJ,ly,J:ly) and q{aj) = N{iJ,lj,{aljy). The variational 
lower bound is 



L +VSo.^<-)' V. 




{iyi-xci^J'}cy+xci^c^c^), (14) 

where Lqm is the lower bound for the current model without predictor Zj. To obtain good 
values for /x^^- and (cr^j)^, we use an approximation similar to the one used for the variance 
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parameters in Section 2. If we do not assume a normal form for q{aj) but just the factorization 
q{0) = q{/3c)q{o:v)q{aj) and with the current q{/3c) and q{av) fixed, then the optimal q{aj) is 

q{aj) (X exp{E {log p{aj) + \ogp{y\9))), 

where the expectation is with respect to q{(3c)q{c(v)- We have that 

( \ 1 tt^ n 1 " 

E{\ogp{a,)+\ogp{y\e)) = E ( -- log27r - - log - ^ - ^ log 27r - - ^ z^.av 

" i=i 



2 2 ^ exp (^.av- + ^a^O 

1 1 q;^ n 
■- log 27r log cr?, ^ log 27r — 

2 2 ° 20-2 2 2 



~9 ^ ~ 9 2^ f.Yr> (-yT ,,1 J~r~^ l^Tv? ^ V 

^ j=l ^ j=i exp (^^v-i/^aV + %«i - 2^Vi^aV^Vi) 

We make a normal approximation N{fi^^-,{a'^^-Y) to the optimal q{aj) via the mode and 
negative inverse second derivative of f|T5l) . Differentiating with respect to a^, we obtain 



a 

72 



' ^tt 2 ^ exp(z,,a,) exp {z^^fxly - Iz^^ElyZvi) 



Approximating exp{—Zijaj) ^1— Zijaj (i.e. using a Taylor series expansion about zero), setting 
the derivative to zero and solving gives 



1=1 / \ 1=1 



To get more accurate estimation of the mode, some optimization procedure may be used here 
with f|T6|) used as an initial point. In our R implementation, the Newton method was used 
because (fTSjl has its second derivative available in a closed form (see (1T7|) below). We found 
that (fT6l) is a very good approximation as the Newton iteration very often stops after a small 
number of iterations (with a stopping tolerance as small as 10~^°). 

Differentiating ( ITSll once more, and finding the negative inverse of the second derivative 



at fi'lj gives 



- + -T '''''' \ (17) 
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We can plug these values back into the lower bound in order to rank different predictors for 
inclusion in the variance model. Once again we note the computationally thrifty nature of 
the calculations. We write the optimized value of f lT^ as L^{C,V), the superscript D means 
the lower bound associated with the model for standard (deviance. 

3.3 Summary of the algorithm 

We summarize our variable selection algorithm below. We write L{C,V) for the optimized 
value of the lower bound (jl]) with the predictor set C in the mean model and the predictor 
set V in the variance model. Write C+j for the set CU{j} and Vj^j for the set l^U{j}. 

Algorithm 2: Variational approximation ranking (VAR) algorithm. 

1. Initialize C and V and set Lopt'-= L{C,V). 

2. Repeat the following steps until stop 

(a) Store Coid: = C, Void-=V. 

(b) Let J* = aTgmaxj{Lf{Cy) + \ogp{C+,y)}. If L(C+,*y) + logp(C+,*y) > Lopt + 
logp(Cy)} then set C:=C+,,, 1,^, = L{C+j,y). 

(c) Let J* = argmax,{Lf (Cy) + logp(Cy+,)}. If L(Cy+,.) +logp(Cy+jO > ^opt + 
logp(Cy) then set V:=V+j,, L,p, = L{Cy+,,). 

(d) If C = Coid and V = Vo\a then stop, else return to (a). 

3.4 Forward-backward ranking algorithm 

The ranking algorithm described above can be regarded as a forward greedy algorithm because 
it considers adding at each step another predictor to the current model. Hereafter we refer 
to this algorithm as forward variational ranking algorithm or fVAR in short. Like the other 
forward greedy algorithms that have been widely used in many scientific fields, the fVAR 
works well in most of the examples that we have encountered. However, a major drawback 
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with the forward selection algorithms is that if a predictor has been wrongly selected then it 
can not be removed anymore. A natural remedy for this is to add a backward elimination 
process in order to correct mistakes made in the earlier forward selection. We present here a 
recipe for ranking predictors for exclusion in the mean and variance models. 

Let C, V be the current sets of predictors in the mean and variance models respectively. 
With j G C, we write C-j for the set C\{j} and consider now the effect of removing the 
predictor Xj to the lower bound. In order to reduce computational burden, we need some way 
to avoid the need to do lower bound maximization for each model C-j when ranking Xj for 
exclusion. Similar as before, we consider a variational approximation using the factorization 
(Q for the variational posterior distribution. Following steps f fT0|) -f lT3|) . we can approximately 
write the lower bound for the current model (i.e. the model contains xj) as the sum of the 
lower bound for the model without xj and a Xj-based term 

L{C,V)^L{C.„V) + T^_^y{j), (18) 

with 

where /t^^-, a^^- are as in (fTTj) and ( fT2l) with C replaced by C-j. All the relevant quantities 
needed in the calculation of ^c^j vU) fixed at optimized values maximizing the lower 
bound for the current model. The subscripts C-j,V emphasize that the quantities needed are 
adjusted correspondingly when the predictor j is removed from the mean model. The most 
plausible candidate for exclusion from the current mean model then is 

f = argmax^.g^{L(C_„ V) + logp(C_„ V)} = argmin^.,^{r^^_^^^(j) - logp(C_„ V)}. (20) 

We now rank the predictors for exclusion in the variance model. Following the arguments in 
Section 13.21 and the above, we can write 

L{C,V)^L{C,V.,) + rE.yAj), (21) 



16 



with 



{{yi-^CilJ'licf+Xci^c^Ci), (22) 

where fi^^j, a'^j are as in f lT6|) - f|T7|l with V replaced by V^j. The most plausible candidate for 
exclusion from the current variance model then is 

f = argmax^.g^,{L(C, V^j) + logp(C, V.,)} = argmin^.g^{rg^_^.(j) - logp(C, V.j)}. (23) 

Algorithm 3: Forward-backward variational approximation ranking algorithm. 

1. Initialize C and V, and set Lopt = L{C,V). 

2. Forward selection: as in Step 2 in Algorithm 2. 

3. Backward elimination: Repeat the following steps until stop 

(a) Store Coid: = C, Vo\d-=V. 

(b) Find J* as in (I2D]). If L(C_j-,,y)+logp(C„^-,,y) >Lopt+logp(C,\/) then set C = C,j*, 

(c) Find J* as in li L{Cy^j,)+\ogp{C,V.j*)> Lopt+\ogp{Cy) then set V = V-j., 

(d) If C = Coid and V = Vo\d then stop, else return to (a). 

Hereafter we refer to this algorithm as fbVAR. 

In some applications where A = Z, it might be meaningful to restrict the search for inclusion 
in the variance model to those predictors that have been included in the mean model. To 
this end, in the forward selection we just need to restrict the search for the most plausible 
candidate j* in Step 2(c) of Algorithm 2 to set C, i.e. j* = argmaXjgc'{Lj^(C,y)+logp(C,V,)}. 
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Also, when considering the removal of a candidate j from the mean model in the backward 
elimination, we need to remove j from the variance model as well if j & i.e. Step 3(b) of 
Algorithm 3 must be modified to 

3(b') Let J* = argmin^.,^{r*f_^_^_^(j)-logp(C^-,y-,)}. If +logp(C_,.,V_,0 > 

Lopt+logp(C,V) then set C = C.,*, V = V.,*, L,^, = L{C^,,y^,.). 

Later we compare with the variable selection approaches for heteroscedastic regression im- 
plemented in the GAMLSS (gen eralized additive model for location, scale and shape) package 



(iRigby and Stasinopoulos 



20051 ). The GAMLSS framework allows modeling of the mean and 
other parameters (like the standard deviation, skewness and kurtosis) of the response distribu- 
tion as flexible functions of predictors. Variable selection is done with stepwise selection using 
a generalized AlC or BIG as the stopping rule. The GAMLSS uses a Fisher scoring algorithm 
to maximize the likelihood for ranking every predictor for inclusion/exclusion rather than only 
the most plausible one as in the VAR algorithm, which leads to a heavy computational burden 
for large-p problems. 



3.5 The ranking algorithm for homoscedastic regression 

In order to get more insight into our VAR algorithm, we discuss in this section the algorithm 
for the homoscedastic linear regression model. In the case of constant variance, the variance 
parameter a now becomes scalar, we rename the quantities S° , as (cr°)^, (c^)^ respectively. 
The optimal choice ([5]) for p{a) becomes 

Using the approximation exp(— a)^l— a, it is easy to see that the mean and variance of the 
normal approximation are 

respectively. We now can replace Steps 4 and 5 in Algorithm 1 by these two closed forms so 
that the computations can be reduced greatly. Similar to the discussion in Section 13. 2^ the 
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Newton method may be used here in order to get a more accurate estimate of the mode. In 
our experience, however, this is not necessary here. 

For the variable selection problem, we now just need to rank the predictors for inclu- 
sion/exclusion in the mean model. Assume that we are using the uniform model prior, i.e. 
p{C,V) =constant, or a model prior as in ([7]), the ranking of predictors then follows the rank- 
ing of lower bounds. We further assume that the design matrix X has been standardized such 
that — optimizer ((^^j)^ in ( !T2|) does not depend on j, and the ranking of the lower 

bound (fT3|) follows the ranking of |X]r=i^«i(?/*~^c'j/^^c) I (^•^- follows the ranking of the ab- 
solute correlation of the predictors with the standardized residuals from the current model). 
This result agrees with frequentist matching pursuit and greedy algorithms where predictors 
are ranked according to the correlat i on between a predictor and the residuals from the cur- 



rent model ( iMallat and Zhang 



1993 



Zhang 



2009 



Efron et al. 



2004 



2004 ) . This is also s imilar to 



Zhao and Yu 



20071 ). 



computationally thrifty path following algorithms ( lEfron et al.l . 

For the existing frequentist algorithms for variable selection, extra tuning parameters, such 
as the shrinkage parameter in penalization procedures, the number of iterations in matching 
pursuit and the stopping parameter e in greedy algorithms, need to be chosen. And their 
performance depends critically on the method used to choose these tuning parameters. Our 
method does not require any extra tuning parameters. The final model is chosen when the 
lower bound (plus the log model prior) i s maximized - a widely used stopping rule in Bayesia n 



model selection with variational Bayes (IBeal and Ghahramani 



2003 



2006 



Wu et al. 



2010h . 



Unlike many commonly used greedy algorithms, our Bayesian framework is able to incor- 
porate prior information (if available) on models and/or to encourage parsimonious models 
desired. Besid e s involving e xtra t uning parameters, penalized estimates are often biased 



( iFriedman 



2008 



Efron et al 



2004[ 1. While our method can penalize non-zero coefficients 
through the prior if desired, it does not rely on shrinkage of coefficients to do variable selec- 
tion, so that in principle it might produce better estimation of non-zero coefficients. Simulation 
studies in Section |5] confirm this point. Note that we do not consider models of all sizes, the 
algorithm stops when important predictors have been included in the model, so that compu- 
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tations in Algorithm 1 just involve matrices with low- dimension. This is another advantage 
which makes our method potentially valuable for variable selection in high-dimensional prob- 
lems. Our experience shows that the VAR algorithm is as fast as the LARS algorithm in 
problems with thousands of predictors. 



4 Applications 



Example 1: biscuit dough data. The biscuit dough data is a benchmark "large p, small 



n" data set that was originally designed and analyzed in lOsborne et al 



( Il984j ). The purpose 



of this study was to investigate the practical benefit of using near-infrared (NIR) spectroscopy 
in the food processing industries. In their experiment, the aim was to predict biscuit dough 
constituents based on near-infrared spectrum of dough samples. The four constituents of 
interest were fat, sucrose, fiour and water. Two data sets (training set and prediction 
or validation set D^) were made up in the same manner in which the percentages of four 
constituents were exactly calculated. These percentages serve as response variables. There 
were 39 samples in the training set and 31 in the prediction set. The NIR spectrum of dough 
pieces was measured at equally spac e d wav elengths from 1100 to 2498 nanometers (nm) in 



steps of 2 nm. Following 



Brown et al. 



(120011 ) , we removed the first 140 and last 49 wavelengths 
because they were thought to contain little useful information. From remaining wavelengths 
ranging from 1380 to 2400 nm, every second wavelength was considered, which increases the 
space step to 4 nm. The final data sets consist of 256 predictors and four responses which 
were treated separately in four univariate linear regression models rather than in a single 
multivariate model. 

The most popular and easiest way to check heteroscedasticity is to use plotting techniques. 
When the OLS fit is valid, plotting studentiz ed residuals against fitted values is a powerful 



technique to use (jCarroU and Ruppert 



we first used the adaptive Lasso (aLasso) of 



19881). In our current case of "large p, small n" 



Zoul (120061 ) to select likely predictors and then 



applied the above technique to the selected predictors. We name this method aLasso-OLS. 
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Figure |2] shows the plots of aLasso-OLS studentized residuals versus fitted values (where 
homoscedasticity was assumed), and also the corresponding plots resulting from our fbVAR 
algorithmic (where heteroscedasticity was assumed) for the response variables sucrose (1^2) and 
water (I4); all were calculated based on the training set. The plots for the other responses 
were not shown because the need for a heteroscedastic model was not visually obvious. We 
can see that in general fitting the homoscedastic regression model to these responses was 
not appropriate here. Looking at the aLasso-OLS plot for Y4, for example, there was clear 
evidence that (absolute values of) residuals decrease when fitted values increase, and the 
heteroscedastic model estimated by the VAR method gave a more satisfying residual plot. 
For the response Y2, the VAR did not select any predictor (except the intercept) for inclusion 
in the mean model, although several predictors were selected for the variance model. This 
"non-null" variance model reflects the heteroscedasticity which is visualizable in the aLasso- 
OLS plot for Y2. The null model for the mean model was quite a surprise, since all the works 
analyzing Y2 assuming the homoscedastic linear model that we are aware of in the literature 
reported non-null models. The aLasso in our analysis selected only one predictor, the 130th. 
Among the plots of all 4 responses against all selected predictors, the plot of Y2 against the 
selected predictor (by the aLasso of course) looked very random compared to the others. This 
in some sense supported visually the null mean model for 1^2- 

We then employed the resulting models to make predictions and used the validation set 
to examine the appropriateness of assuming heteroscedasticity for this biscuit dough data. 

The usefulness of a model was measured by two criteria: one was the mean squared error of 

prediction defined as 

and the other was the partial prediction score 

PPS = -^ -logp{y\x), 



^We did not apply the restriction here, because there was no good reasons to restrict the search for inclusion 
in the variance model to the predictors in the mean model. The search combined both forward and backward 
moves and the uniform model prior (i.e. n^^n^ = 1/2) was used. 
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Figure 2: The biscuit dough data. 
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MSE 


PPS 


aLasso 


VAR 


aLasso 


VAR 


fat 


2.61 


0.09 


1.91 


0.25 


sucrose 


13.56 


14.87 


2.73 


2.77 


flour 


4.43 


0.79 


2.16 


1.37 


water 


0.64 


0.18 


1.20 


0.64 



Table 1: The biscuit dough data: MSE and PPS evaluated on the validation set for the aLasso 
and VAR methods. 

where p{.) is the density estimated under the model. It is understood that the smaller the 
MSE and PPS, the better the model. The MSE and PPS evaluated on the 31 samples of 
the validation set for the aLasso and VAR methods are summarized in Table [TJ Except for 
the case of Y2 (sucrose), the heteroscedastic models estimated by the VAR method had a big 
improvement over the homoscedastic models estimated by the aLasso. The poor predictive 
performance of the VAR (and the aLasso as well) on Y2 was probably due to the reasons 
discussed above: there was no linear relationship between the NIR spectrum and the sucrose 
constituent. 



Brown et al. 



( I2OOII ) using a Bayesian 



This biscuit dough data was also carefully analyzed in[ 
wavelet regression framework. They first used a wavelet transformation to transform the orig- 
inal predictors to wavelet coefficients and then applied a Bayesian (homoscedastic) regression 
approach to regress the responses on the derived wavelet coefficients. Prediction was done 
using Bayesian model averaging (BMA) over a set of 500 likely models, and MSE values were 
reported to be 0.06, 0.45, 0.35 and 0.05 respectively. This methodology seems not comparable 
to ours because (i) it was conducted based on wavelet coefficients rather than the original 
predictors and (ii) prediction was done using BMA rather than a single selected model. 

Because the four response variables are percentages and sum to 100, an anonymous re- 
viewer raised a concern about spurious correlations between them. While this may be a 
concern for a multivariate analysis, we treated the four responses separately in four univariate 
linear regression models rather than in a single multivariate model, so that compositional ef- 



fects would not be a problem here. To justify this, we considered the following transformation 
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(lAitchisonl . Il986l ) of the responses 



^3 



1,2,4. 



'he choice o f 



( iBrown et al. 



or denominator was natural because the flour is a major constituent 
20011 ) ■ After fitting the regression models with these three new responses, 
the fitted and predicted values were transformed back to the original scale via 

100 exp{Ui) . _ . , _ 100 



Y, 



1,2,4 and Y^ = ^ TTTTXT- 



Eexp(?7,) + 1' 

The MSE evaluated on the validation set for the aLasso were 4.71, 13.43, 7.07, 1.45 and for 
the VAR method were 2.16, 5.22, 3.36, 0.57. We did not report PPS because it is not clear 
how to properly calculate PPS in the case of heteroscedastic regression for such tranformed 
data. Comparing to the result in Table [Tj it seems that the above transformation which is to 
account for potential compositi onal effects does not give a positive impact overall. This result 



also agrees with the analysis of 



Brown et al. 



(1200 ih 



Example 2: diabetes data. In the second application we applied the VA R method to 



analy zing a benchmark data set in the literature on progression of diabetes ( lEfron et al. 
20041 ). Ten baseline variables, age, sex, body mass index, average blood pressure and six 
blood serum measurements, were obtained for each of n = 442 diabetes patients, as well as the 
response of interest y, a quantitative measure of disease progression one year after baseline. 
We constructed a (heteroscedastic, if necessary) linear regression model to predict y from 
these ten input variables. In the hope of improving prediction accuracy, we considered a 
"quadratic model" with 64 predictors. We distinguish between input variables and predictors, 
for example, in a quadratic regression model on two input variables age and income, there are 
five predictors (age, income, age x age, income x income and age x income). 

The analysis of the full data set showed clear evidence of heteroscedasticity. See again 
Figure [1] for the solution paths resulting from our VAR algorithm with the uniform model 
prior (where only forward selection was implemented and the search for inclusion in the 
variance model was restricted). The VAR and GAMLSS both selected some predictors to 
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include in the variance model. Furthermore, there was quite a clear pattern in the plot of 
the OLS studentized residuals indicating heteroscedasticity (results not shown). Interestingly, 
when fitting y with only ten input variables as the predictors, diagnostics and the selected 
model by VAR both showed no evidence of heteroscedasticity. This result agreed with the 
homoscedasticity assumption often used in the literature for this diabetes data set. 

To assess predictive performance, we randomly selected 300 instances to form the training 
set, with the remainder serving as the validation set. Of 64 predictors, the VAR selected 13 
to include in the mean model and 12 to include in the variance model, while the GAMLSS 
selected 23 and 7 respectively. Under the assumption of constant variance, the aLasso selected 
43 predictors. On the validation set, the models estimated by aLasso, GAMLSS and VAR 
had PPS of 5.50, 15.93, 5.41 and MSB of 3264.95, 3506.32, 2993.16 respectively. In order to 
reduce the uncertainty in training-validation separation, we recorded the MSE and PPS over 
50 such random partitions, and obtained the averaged MSE for aLasso, GAMLSS and VAR 
of 3715.08 (641.56), 4069.81 (1681.70), 3082.78 (774.85) and the averaged PPS of 5.84 (0.15), 
56.72 (11.52), 5.76 (0.16) respectively. The numbers in brackets are standard deviations. The 
GAMLSS method performed poorly in this example but it should be stressed that we have 
only used the default implementation (i.e. stepwise selection with both forward and backward 
moves and the generalized AIC used as the stopping rule) in the GAMLSS R package. Further 
experimentation with tuning parameters in the information criterion might produce better 
results. 

5 Experimental studies 

In this section, we present experimental studies for our method. We first compare the accuracy 
of our variational approximation algorithm to that of MCMC in approximating a posterior 
distribution. We then compare the VAR method for variable selection to the aLasso and 
GAMLSS in both heteroscedastic and homoscedastic regression. In the examples described 
below, the EBIC prior (jH]) was used as a default prior. This prior has very little impact in low- 
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dimensional cases but considerab 



parsimony ( IChen and Chenl . 



20081). 



e impact in high-dimensional cases in terms of encouraging 



The accuracy of the variational approximation. In this example we demonstrate the 
accuracy of the variational approximation for describing the posterior distribution in a het- 



eroscedastic mo del, without con sidering t he mod e^ 
set described in 



WeisbergI (120051 ). see also 



selection aspects. We considered a data 



SmythI (119891 ). The data were concerned with the 



hydrocarbon vapours which escape when petrol is pumped into a tank. Petrol pumps are 
fitted with vapour recovery systems, which may not be completely effective and "sniffer" de- 
vices are able to detect if some vapour is escaping. An experiment was conducted to estimate 
the efficiency of vapour recovery systems in which the amount of hydrocarbon vapour given 
off, in grams, was measured, along with four predictor variables. The four predictor variables 
were initial tank temperature (xi), in degrees Fahrenheit, the temperature of the dispensed 
gasoline (^2), in degrees Fahrenheit, the initial vapour pressure in the tank (X3), in pounds 
per square in ch, and t he ini tial vapour pressure of the dispensed gasoline (X4), in pounds per 
square inch. iSmythl (ll989i) considers fitting a heteroscedastic linear model with the mean 
model 

= /^ifi'i + P292 + + PiX2 + P->gi2Xi + PaQ'iXi 



and the variance model 



logcr^ = ttO + OiiX2 + 023^4, 



where gi, g2 and are three binary indicator variables for different ranges of Xi and gi2 = 
gi+g2- In fitting the mean model the last three terms are orthogonalized with respect to 
the first three, so that the coefficients of the indicators are simply group means for the cor- 
responding subsets of Xi, and in the variance model X2 and 0:4 were mean centered. We 
considered our variational approximation to the posterior distribution in a Bayesian analysis 
where the priors were multivariate normal with mean zero and covariance 10000/ for both /3 
and a. Figure [3] shows estimated marginal posterior densities for all parameters in the mean 
and variance models. The top two rows show the mean parameters and the bottom row the 
variance parameters. The solid lines are kernel density estimates of the marginal posteri- 



26 



ors constructed from MCMC samples and the dotted lines are the variational approximate 
posterior marginals. The mean and variance from the variational approximation were used 
to define a multivariate Cauchy independence proposal for a Metropolis-Hastings scheme to 
generate the MCMC samples. 100,000 iterations were drawn, with 1,000 discarded as "burn 
in". One can see that for the mean parameters, the variational approximation is nearly exact. 
For the variance parameters, point estimation is very good, but there is a slight tendency for 
the variational approximation to underestimate posterior variances. The final lower bound is 
-326.68, with agreement to two decimal places within the first two iterations and convergence 
after 5 ite rations. Compared to -326 .5, the marginal likelihood computed using the MCMC 



method of 



Chib and Jeliazkovl (l200l[ ). this lower bound is very tight. 



Heteroscedastic case. We present here a simulation study for our VAR method for simulta- 
neous variable selection and parameter estimation in heteroscedastic linear regression models, 
and compare its performance to that of the GAMLSS and aLasso methods. Data sets were 
generated from the model 

y = 2 + x^P + ae^^^^e, (24) 

with /3 = (3, 1.5, 0, 0, 2, 0, 0, 0)"^, e~A^(0, 1). Predictors x were first generated from normal 
distributions A^(0,S) with Sj^ = 0.5'*^-'' and then transformed into the unit interval by the 
cumulative distribution function $(.) of the standard normal. The reason for making the 

1 T - 

transformation was to control the magnitude of noise level, i.e. the quantity ae^^ Let 
(3 = (2, and a= (logcr^,a"^)"^ be the mean and variance parameters respectively, where 

a = (0, 3, 0, 0, —3, 0, 0, 0)-^. Note that the true predictors in the variance model were among 
those in the mean model. This prior information was employed in the GAMLSS and VAR. 

The performance was measured by correctly-fitted rates (CFR), numbers of zero-estimated 
coefficients (NZC) (for both mean and variance models), mean squared error (MSE) of pre- 
dictions and partial prediction score (PPS) averaged over 100 replications. MSE and PPS 
were evaluated based on independent prediction sets generated in the same manner as the 
training set. We compared the performance of the VAR and GAMLSS methods (when het- 
eroscedasticity was assumed) to that of the aLasso (when homoscedasticity was assumed). 
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Figure 3: Estimated marginal posterior densities for coefficients in the mean and variance 
models for the sniffer data. Solid lines are kernel estimates from MCMC samples from the 
posterior and dashed lines are variational approximate marginal posterior densities. 



28 



The simulation results are summarized in Table [2] for various factors sample size n, np (size 
of prediction sets D^) and a. As shown, the VAR method did a good job and outperformed 
the others. 

We also considered a "large p, small n" case in which /3 and a in model were vectors of 
dimension 500 with most of the components zero except /35o = /5ioo = --- = /325o = 5, /33oo = /^350 = 
••• = /3500 = — 5 and aioo = '5;2oo = 5, a3oo = '540o = — 5. The simulation results are summarized in 
Table [3l Note that the GAMLSS is not applicable when n<p, and moreover that in the case 
with n>p and with large p the current implementation version of the GAMLSS is much more 
time consuming compared to the VAR and even not working with p as large as 500, since the 
package was not designed for such applications. We are not aware of any existing methods 
in the literature for variable selection in heteroscedastic linear models for "large p, small n" 
case. 

Homoscedastic case. We also considered a simulation study when the data come from 
homoscedastic models. Data sets were generated from the linear model (1241) with a = 0, i.e. 

y = 2 + x^P + ere 

with predictors x generated from normal distributions A^(0,S) with Sj^ = 0.5'*^-^'. We were 
concerned with simulating a sparse, high-dimensional case. To this end, /3 was set to be a 
vector of 1000 dimensions with the first 5 entries were 5, —4, 3, —2, 2 and the rest were 
zeros. We used the modified ranking algorithm discussed in Section 13.51 with both forward 
and backward moves and the default prior ([8]). The performance was measured as before by 
CFR, NZC and MSE but MSE was defined as the squared error between the true vector /3 and 
its estimate. The simulation results are summarized in Table |H The big improvement of the 
VAR over the aLasso in this example is surprising and probably due to the reasons discussed 
in Section [331 

Remarks on calculations. The VAR algorithm was implemented using R and the code is 
freely available on the authors' websites. The weights used in the aLasso were assigned as 
usual as with fij being the MLE (when p<n) or the Lasso estimate (when p>n) of 13 j. 
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(5.94) 
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0.46 






PPS 


1.06 




0.87 




0.74 




1 


CFR in mean 


56 (4.36) 


36 


(4.06) 


88 


(4.88) 






CFR in var. 


nil 


82 


(5.80) 


100 


(6.00) 






MSE 


2.01 




1.93 




1.92 






PPS 


1.77 




1.52 




1.42 



Table 2: Small-p case: CFR, NZC, MSE and PPS averaged over 100 replications. The numbers 
in parentheses are NZC. The true number of non-zero coefficients in the mean model was 5 
and in the variance model was 6. 



n = np 


<j 


VAR 


aLasso 


CFR in mean 


CFR in var. 


MSE 


PPS 


CFR in mean 


MSE 


PPS 


100 


0.5 


80 (489.75) 


90 (495.90) 


5.40 


1.91 


20 (491.80) 


11.65 


2.66 




1 


70 (489.05) 


65 (495.80) 


20.29 


2.31 


(495.75) 


35.11 


3.28 


150 


0.5 


100 (490.00) 


95 (495.90) 


13.77 


0.85 


40 (491.95) 


20.02 


3.41 




1 


95 (489.95) 


85 (495.85) 


28.97 


1.52 


5 (495.05) 


43.19 


3.69 



Table 3: Large-p case: CFR, NZC, MSE and PPS averaged over 100 replications. The numbers 
in parentheses are NZC. The true number of non-zero coefficients in the mean model was 490 
and in the variance model was 496. 
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CFR (NZC) 


MSB 


n = np 


a 


aLasso 


VAR 


aLasso 


VAR 


50 


1 


(994.42) 


38 (994.34) 


31.21 


17.72 




Z 


(994.54) 


2 (992.36) 


38.21 


33.16 


100 


1 


46 (995.62) 


96 (994.96) 


8.40 


0.09 




2 


16 (996.14) 


32 (993.56) 


11.87 


2.09 


200 


1 


90 (995.10) 


98 (994.98) 


6.34 


0.04 




2 


44 (995.56) 


32 (993.40) 


7.78 


0.62 



Table 4: Homoscedastic case: CFR, MSE and NZC averaged over 100 replications for aLasso 
and VAR. The true number of non-zero coefficients was 995. 

The tuning parameter A was selected by 5-fold cross-validation. The implementation of the 
aLasso and GAMLSS was carried out with the help of the R packages glmnet and gamlss. 



6 Concluding remarks 



We have presented in this paper a strategy for variational lower bound maximization in 
heteroscedastic linear regression, and a novel fast greedy algorithm for Bayesian variable 
selection. In the homoscedastic case with the uniform model prior, the algorithm reduces to 
widely used matching pursuit algorithms. The suggested methodology has proven efficient, 
especially for high- dimensional problems. 

Benefiting from the variational approximation approach - a fast deterministic alternative 
and complement to MCMC methods for computation in high- dimensional problems - our 
methodology has potential for Bayesian variable selection in more complex frameworks. A 
potential research direction is to extend the method to simultaneous variable selection and 
numb er of experts selection in flexible regression d ensity estimation with mixtures of ex- 



perts (iGeweke and Keand . 



2007 



Villani et al. 



20091 ). This research direction is currently in 



progress. Another potential research direction is to extend the method to grouped variable 
selection. 



31 



Appendix A 



Below we write Eq{-) for an expectation with respect to the variational posterior. In the 
notation of Section 1 we have 



and 



Ti 



-^log2vr-llog|E°|-llog|E°| 

-Imp - - 4)) - \mo^ - - f^a)) 

-i^log27r-llog|E°|-ilog|E°| 



i=l \ i=l V « ' 

n 1 J. 1 ^ xJHtxi + {yi — xf^i)"^ 

- -2^°^^--2g^^^"-2gexp(.f,^-|.fE^.) 



n = ^log27r + -log|S^| + -log|S^ 



— log2^ + - - 

^ ^log2. + llog|E^| + llog|E^| + ^. 

In evaluating T2 above we made use of the independence of /3 and a in the variational poste- 
rior and of the moment generating function for the multivariate normal variational posterior 
distribution for a. Putting the terms together, the variational lower bound simplifies to (jlj). 



Appendix B 

Denote by W{a) the diagonal matrix diag(|wjexp(— a)), then 

(a) _ 1 , 

da 



d\ogq{a) 1^ ^ 0-1 



and 



g^logg(a) ^ 0- 
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The Newton method for estimating the mode is as follows. 

• Initialization: Set starting value a^^\ 

• Iteration: For k — 1,2,..., update a'^''^ — a'^''~^^+A~^{a^''~^^)u{a'^''~^^) until some stopping 
rule is satisfied, such as Ha^*^^ — a^^^^^^ll <e with some pre-specified tolerance e. 
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